I am looking for an AI specialist with extensive experience in AI to develop a Windows Service in C# that can do the following: Every day, visit a list of approximately 800 URLs of real estate agency websites and navigate through the pages to search for newly listed properties added by the agencies.
Next, these property pages must be read, and the relevant data extracted to be stored in a fixed format in tables on an SQL server.
A number of data fields are mandatory, such as:
The direct URL of the property page within the real estate agency's website (to enforce uniqueness) The city where the property is located The street where the property is located The property type, where the choice comes from our fixed list: entire home, apartment, studio, etc. The engine must select the closest match from our list The number of rooms The monthly rental price Whether this price includes or excludes service charges The date the property is available The surface area in square meters A list of URLs of the photos associated with the property Additionally, there is a list of optional fields we would like to retrieve if the information is available:
Municipality District Postal code House number Number of bedrooms Number of bathrooms Year of construction Is there a: garden, garage, rooftop terrace, balcony? Condition of the property Is the property furnished? ...and so on A complete list will be provided.
The challenge lies in the fact that each real estate agency uses a different paging method and different page layouts. Furthermore, some agencies include all the information in one block of text, while others display much of the data in columns. This can also change unexpectedly. Therefore, the software must be resilient and capable of understanding how to navigate through the pages to look for new properties.
A second challenge is that some agencies include photos of other nearby properties under the details of a specific property. The tool must recognize that these photos do not belong to the property in question and should ignore them.
Preferably, we would use—due to cost considerations—an AI model that does not rely on a commercial API, unless doing so offers such significant benefits that it is worthwhile.
I would love to hear about your experience and how you would approach this. Specifically: which AI method/engine you would use and the flow of the software.